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PROBLEM TO BE SOLVED: To make retrievable speech data similar 

**** j ^ ^gggg^ «SS : *° a retrieval key at a high speed by shortening the time of 

I **i£tt I \ \ 1 1 1S ' m j J i 

^- ■ calculation of a similar distance by a simple matching between 

fixed-length data in retrieval of the speech data. 

SOLUTION: A time window segmentation part 12 extracts partial 

speech time-series feature quantities which are different in length 

from an aural signal to be retrieved by using time windows of a 

plurality of kinds of lengths and a partial speech time- series feature 

quantity length fixation part 13 linearly expands or contracts partial 

speech time-series feature quantities to a specified reference time 

window length and stores them in a speech storage part 14. In 

retrieval, a retrieval key speech time-series feature quantity 

extraction part 1 8 extracts a retrieval key speech time-series 

feature quantity vector with the length of a reference time window 

from the input speech signal of the retrieval key and a feature quantity information comparison part 16 
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calculates the similarity distance between the extracted vector and a speech time-series feature quantity 
vector to be retrieved to decide a vector having higher similarity as a retrieval result. 

* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

1. This document has been translated by computer. So the translation may not reflect the original precisely. 

2. **** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



CLAIMS 



[Claim(s)] 

[Claim 1]A process in which partial tone voice time series characteristic quantity from which length differs 
using a time window of two or more kinds of length, respectively is extracted from an audio signal used as a 
retrieval object, Linearity elasticity of the partial tone voice time series characteristic quantity of two or 
more kinds of extracted length is carried out, A process arranged with the length of a base period window 
used as a standard of similar distance calculation at the time of search, A process accumulated as a voice 
time series feature amount vector used as a retrieval object for measuring partial tone voice time series 
characteristic quantity arranged with the length of said base period window with a voice time series feature 
amount vector obtained from an audio signal into which it is inputted as a search key, A process in which a 
search key voice time series feature amount vector of the length of said base period window is extracted 
from an audio signal inputted or specified as a search key, A process in which calculate similar distance of a 
voice time series feature amount vector accumulated as said retrieval object, and said search key voice 
time series feature amount vector, and similarity of an audio signal of a search key and an audio signal of 
each audio signal section which is a retrieval object is computed, An audio signal search method having a 
process in which search results are outputted based on a computed result of similarity. 
[Claim 2]A process in which partial tone voice time series characteristic quantity is extracted from an audio 
signal used as a retrieval object using a base period window of length used as a standard of similar distance 
calculation at the time of search, A process accumulated as a voice time series feature amount vector used 
as a retrieval object for measuring extracted partial tone voice time series characteristic quantity with a 
voice time series feature amount vector obtained from an audio signal into which it is inputted as a search 
key, A process in which partial tone voice time series characteristic quantity from which length differs using 
a time window of two or more kinds of length, respectively is extracted from an audio signal inputted or 
specified as a search key, Linearity elasticity of the partial tone voice time series characteristic quantity of 
two or more kinds of extracted length is carried out, Make into a search key voice time series feature 
amount vector partial tone voice time series characteristic quantity arranged with a process arranged with 
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the length of said base period window, and the length of said base period window, and as said retrieval object. 
Similar distance of an accumulated voice time series feature amount vector and said search key voice time 
series feature amount vector is calculated, An audio signal search method having a process in which 
similarity of an audio signal of a search key and an audio signal of each audio signal section which is a 
retrieval object is computed, and a process in which search results are outputted based on a computed 
result of similarity. 

[Claim 3]In a process in which partial tone voice time series characteristic quantity from which length 
differs using a time window of two or more kinds of length, respectively is extracted from said audio signal. 
Shifting a base period window of length which extracts voice time series characteristic quantity from said 
audio signal, and serves as said standard from there little by little. The audio signal search method according 
to claim 1 or 2 starting partial tone voice time series characteristic quantity, and starting partial tone voice 
time series characteristic quantity shifting little by little similarly further by a time window of two or more 
kinds of length centering on the length of a base period window. 

[Claim 4]Claim 1, wherein a voice time series feature amount vector accumulated as said retrieval object 
and said search key voice time series feature amount vector carry out linearity compression of the partial 
tone voice time series characteristic quantity of the length of a base period window at predetermined length, 
the audio signal search method according to claim 2 or 3. 

[Claim 5]Input information which specifies the outputted audio signal section as search results as said 
search key once searching, and a voice time series feature amount vector corresponding to the specified 
audio signal section is made into said search key voice time series feature amount vector, An audio signal 
search method given in either from claim 1 performing re retrieval by similar distance calculation with a 
voice time series feature amount vector accumulated as said retrieval object to claim 4. 
[Claim 6]A process in which an audio signal used as a retrieval object is inputted, and a process in which 
voice time series characteristic quantity is extracted from an inputted audio signal, A process in which 
partial tone voice time series characteristic quantity is started while shifting a base period window which is 
a time window of length which serves as a standard from said voice time series characteristic quantity little 
by little, A process in which partial tone voice time series characteristic quantity is started while shifting 
little by little also about a time window of two or more kinds of length centering on the length of said base 
period window, A process which carries out linearity elasticity of the partial tone voice time series 
characteristic quantity of said two or more kinds of length, and is arranged with the length of a base period 
window, Partial tone voice time series characteristic quantity arranged with the length of said base period 
window as a search key. An audio signal accumulating method for voice search having a process 
accumulated as a voice time series feature amount vector used as a retrieval object for comparing with a 
voice time series feature amount vector obtained from an audio signal inputted. 

[Claim 7]An audio signal accumulating method for the voice search according to claim 6 making what carried 
out linearity compression of the partial tone voice time series characteristic quantity arranged with the 
length of said base period window at predetermined length into a voice time series feature amount vector of 
a retrieval object to accumulate. 

[Claim 8]A means to extract partial tone voice time series characteristic quantity from which length differs 
using a time window of two or more kinds of length, respectively from an audio signal used as a retrieval 
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object, Linearity elasticity of the partial tone voice time series characteristic quantity of two or more kinds 
of extracted length is carried out, A means arranged with the length of a base period window used as a 
standard of similar distance calculation at the time of search, A means to accumulate as a voice time series 
feature amount vector used as a retrieval object for measuring partial tone voice time series characteristic 
quantity arranged with the length of said base period window with a voice time series feature amount vector 
obtained from an audio signal into which it is inputted as a search key, A means to extract a search key 
voice time series feature amount vector of the length of said base period window from an audio signal 
inputted or specified as a search key, A means to calculate similar distance of a voice time series feature 
amount vector accumulated as said retrieval object, and said search key voice time series feature amount 
vector, and to compute similarity of an audio signal of a search key, and an audio signal of each audio signal 
section which is a retrieval object, An audio signal retrieval device provided with a means to output search 
results based on a computed result of similarity. 

[Claim 9]A means to extract partial tone voice time series characteristic quantity from an audio signal used 
as a retrieval object using a base period window of length used as a standard of similar distance calculation 
at the time of search, A means to accumulate as a voice time series feature amount vector used as a 
retrieval object for measuring extracted partial tone voice time series characteristic quantity with a voice 
time series feature amount vector obtained from an audio signal into which it is inputted as a search key, A 
means to extract partial tone voice time series characteristic quantity from which length differs using a time 
window of two or more kinds of length, respectively from an audio signal inputted or specified as a search 
key, Linearity elasticity of the partial tone voice time series characteristic quantity of two or more kinds of 
extracted length is carried out, A means arranged with the length of a base period window used as a 
standard of similar distance calculation at the time of search, Make into a search key voice time series 
feature amount vector partial tone voice time series characteristic quantity arranged with the length of said 
base period window, and as said retrieval object. Similar distance of an accumulated voice time series 
feature amount vector and said search key voice time series feature amount vector is calculated, An audio 
signal retrieval device provided with a means to compute similarity of an audio signal of a search key, and an 
audio signal of each audio signal section which is a retrieval object, and a means to output search results 
based on a computed result of similarity. 

[Claim 10]A program for audio signal search for making either from claim 1 to claim 5 perform an audio signal 
search method of a statement to a computer. 

[Claim 11]A recording medium of a program for audio signal search recording a program for making either 
from claim 1 to claim 5 perform an audio signal search method of a statement to a computer. 
[Claim 12]A program for audio signal accumulation for making a computer perform an audio signal 
accumulating method for the voice search according to claim 6 or 7. 

[Claim 13]A recording medium of a program for audio signal accumulation recording a program for making a 
computer perform an audio signal accumulating method for the voice search according to claim 6 or 7. 



DETAILED DESCRIPTION 
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[Detailed Description of the Invention] 
[0001] 

[Field of the Invention]Especially this invention about the art of a voice search system An input. Or it is 
related with the audio signal accumulating method for the audio signal search method for searching the 
audio signal section similar to the specified audio signal out of a voice database, and voice search, an audio 
signal retrieval device, its program, and the recording medium of the program. 
[0002] 

[Description of the Prior Art]As conventional technology of the technique of searching an audio signal 
similar to an input voice signal out of a voice database, there are some which were stated to the following 
reference 1. 

[Reference 1] Takashi Endo, Masayuki Nakazawa, Hironobu Takahashi, the Japanese Society for Artificial 
Intelligence national conference (the 12th time) in the sound, data representation [ by the self-organization 
network of video ], and ******:"both spotting search": 1998 fiscal year, S5-04, pp.1 22-1 25. 
This is IPM (Incremental Path Method). It is the method of searching the audio signal which calculated the 
similarity between audio signals by DP matching and the same dynamic matching, and was similar to the 
input voice signal using the network out of a voice database. According to this method, similarity is 
calculable to a time base direction by using dynamic matching also with the audio signal which is carrying 
out nonlinear elasticity. 
[0003] 

[Problem(s) to be Solved by the Invention]When it is going to search an audio signal with an audio signal, it 
is necessary to calculate the similarity between audio signals. Generally, even when the same character 
string is uttered, nonlinear elasticity of the audio signal is carried out on a time-axis. In order to calculate 
the similarity between audio signals and to correspond to nonlinear elasticity from this, dynamic matching of 
DP matching, Hidden Markov Model (HMM), etc. needed to be used. 

[0004] However, all audio signals do not carry out nonlinear elasticity on a time-axis. As an example of the 
audio signal which does not carry out nonlinear elasticity, singing voice is mentioned first, although the 
tempo of the whole section may come to be alike slowly or may become quick if it is the short-time section 
since tempo exists in singing voice potentially, tempo is not confused within the section. That is, if singing 
voice is within the short-time section, although it may carry out linearity elasticity in the whole section, 
nonlinear elasticity of it is not carried out within the section. 

[0005]There is announcer's utterance speech as other examples of the audio signal which does not carry 
out nonlinear elasticity. Since announcer is very good at uttering the same language at the same tune 
repeatedly, if he is the same language, he will not do nonlinear elasticity. If ordinary persons' utterance is 
also short utterance of about one word, even if it may carry out linearity elasticity in the whole section, 
within the section, it can be considered that nonlinear elasticity has hardly been carried out. 
[0006]In the Prior art, when carrying out similarity calculation between audio signals, dynamic matching was 
used. Therefore, dynamic matching which can respond to the sound which has carried out only linearity 
elasticity, and nonlinear elasticity will be performed. There was a problem that dynamic matching will have 
much computational complexity compared with static matching which performs distance calculation, such 
as Euclidean distance between fixed-length vectors and a Manhattan distance, and search time will become 
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long. 
[0007] 

[Means for Solving the Problem]In order to solve said technical problem, this invention enables it to 
compare characteristic quantity started using a time window in search of voice data at high speed not by 
dynamic matching but by static matching. Therefore, linearity elasticity of the voice data of voice time 
series characteristic quantity is carried out, and step is kept with fixed length data of the length of a certain 
fixed time intervals. By carrying out like this, search of a search key and similar voice data is enabled at high 
speed by simple matching between fixed length data. 

[0008]Specifically, an audio signal retrieval device of this invention consists of a voice time series feature 
amount extracting means, a time window logging means, a partial tone voice time series characteristic 
quantity fixed length-ized means, a search key voice time series feature amount extracting means, a voice 
storage means, a search condition input means, a characteristic quantity information comparison means, 
and a display style creating means. A voice time series characteristic quantity linearity compression means 
and a search key voice time series characteristic quantity linearity compression means may be established. 
[0009]A voice time series feature amount extracting means extracts voice time series characteristic 
quantity from an audio signal which is a time series signal. 

[001 0]A time window logging means prepares a base period window which is a time window of length which 
serves as a standard first. And partial tone voice time series characteristic quantity of the length of a base 
period window is started from voice time series characteristic quantity extracted by a voice time series 
feature amount extracting means, shifting a base period window little by little. Next, a time window of two or 
more kinds of length is prepared focusing on the length of a base period window. Partial tone voice time 
series characteristic quantity is similarly started by a time window of such length. 

[001 1]A partial tone voice time series characteristic quantity fixed length-ized means generates partial 
tone voice time series characteristic quantity which carried out linearity elasticity and arranged with base 
period window length partial tone voice time series characteristic quantity started by a time window of two 
or more kinds of length by a time window logging means, and extracts it as a voice time series feature 
amount vector. 

[001 2]A voice time series characteristic quantity linearity compression means carries out linearity 
compression of the voice time series characteristic quantity of base period window length generated by a 
partial tone voice time series characteristic quantity fixed length-ized means at a certain fixed length, and 
extracts a voice time series feature amount vector. 

[001 3]A search key voice time series feature amount extracting means extracts search key voice system 
sequence characteristic quantity of base period window length, and extracts it from an audio signal of the 
length of a base period window inputted as a search key as a search key voice time series feature amount 
vector. 

[001 4]A search key voice time series characteristic quantity linearity compression means carries out 
linearity compression of the voice time series characteristic quantity of base period window length 
extracted by a search key voice time series feature amount extracting means at the same length as a voice 
time series feature amount vector, and extracts a search key voice time series feature amount vector. 
[0015]From a voice time series feature amount vector which a voice storage means accumulated an 
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inputted audio signal of a retrieval object, and was extracted by a voice time series characteristic quantity 
linearity compression means, create an index and with a voice time series feature amount vector. It 
accumulates and time window logging information which is information on matching with the audio signal 
section of extraction origin of a voice time series feature amount vector further is also accumulated. 
[0016]When a search condition input means uses as a search key the audio signal section accumulated in a 
voice storage means, it inputs conditions for specifying a search key. 

[001 7]A characteristic quantity information comparison means calculates similar distance for a search key 
voice time series feature amount vector extracted in a search key voice time series characteristic quantity 
linearity compression means, and a voice time series feature amount vector accumulated in a voice storage 
means by static matching, and sets up similarity. Thereby, similarity calculation more nearly high-speed than 
dynamic matching becomes possible. And similarity with a search key voice time series feature amount 
vector outputs a voice time series voice feature amount vector to high order. 

[001 8]A display style creating means matches a voice time series feature amount vector which at least 
order was attached and was outputted from a characteristic quantity information comparison means with 
the audio signal section of extraction-time window logging information accumulated in voice storage means 
origin to origin, and outputs it to a display. 
[0019] 

[Embodiment of the Invention][Embodiment 1] Drawing 1 is a lineblock diagram for describing the 
embodiment of the invention 1. According to Embodiment 1, the sound inputted from the outside is used as 
a search key. 

[0020]Operation of this invention comprises the voice storage phase P1, the voice time series feature 
amount vector extraction phase P2 called from it, the voice search phase P3, and the search key voice time 
series feature amount vector extraction phase P4 called from it. Hereafter, operation of each phase is 
explained. 

[0021][A]The voice storage phase P1 and voice time series feature amount vector extraction phase P2 
drawing 2 are the flow charts explaining operation of the voice storage phase P1 and the voice time series 
feature amount vector extraction phase P2. 

[0022]First, the audio signal of a retrieval object is inputted from the retrieval object sound signal input 
device 20, and the audio signal of a voice storage part 14 smell lever is accumulated in the memory storage 
15 (Step S1). 

[0023]Next, in the voice time series feature quantity extracting part 11, voice time series characteristic 
quantity is extracted from the inputted audio signal (Step S2). As voice time series characteristic quantity, 
the speech power of the low following paragraph of the Meru frequency cepstrum coefficient, its primary 
difference, and each [ according to difference, speech power and filter bank analysis the 2nd order ] zone, 
etc. can be expressed with a multi dimensional vector, and what arranged them in order of the time series 
can be used, for example. The example of voice time series characteristic quantity is stated to the following 
reference 2. 

[Reference 2] : "IT text voice recognition system" besides Kiyohiro Kano, Ohm-Sha, 2001. 
Next, in the time window logging part 12, the partial tone voice time series characteristic quantity of base 
period window length is started, setting up the time window of the length used as a standard as a base 
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period window, and shifting this base period window little by little, as first shown in drawing 3 (Step S3). 
[0024]As shown in drawing 4 , the time window of two or more kinds of length centering on the length of a 
base period window is set up, and the time window length's partial tone voice time series characteristic 
quantity is started like the case of a base period window, respectively, shifting a time window little by little 
(step S4). In the example of an experiment mentioned later, the number of base period windows is 150, and 
they are about 26 milliseconds per frame in length. The minimum of the length of a time window considered 
it as 118 frames, and the maximum was made into 1 82 frames. If partial tone voice time series characteristic 
quantity is started using these time windows, the information about logging of the time window will be 
accumulated in the memory storage 15 as time window logging information in the voice storage part 14. 
[0025]In the partial tone voice time series characteristic quantity fixed length-ized part 13, as shown in 
drawing 5 , Linearity elasticity of the partial tone voice time series characteristic quantity started by the 
time window of two or more kinds of length is carried out in a time base direction, respectively, step is kept 
with the length of a base period window (Step S5), and the partial tone voice time series characteristic 
quantity of the base period window length is generated, and let it be a voice time series feature amount 
vector (Step S6). 

[0026]And an index is built from the obtained voice time series feature amount vector, and it accumulates in 
the memory storage 15 of the voice storage part 14 with a voice time series feature amount vector (Step 
S7). As an index structure of the multi-dimension spatial vector, SR-tree stated to the following reference 
3, A-tree stated to the reference 4, etc. can be used, for example. 

[Reference 3] Norio Katayama. and Shin'ichi Satoh:"The. SR-Tree :. [ An Index Structure ] for 
High-Dimensional. Nearest Neighbor Queries",In Proc. ACM SIGMOID International Conference on 
Management of Data ,pp.368-380,May 1997. 

[Reference 4] Yasushi Sakurai, Masatoshi Yoshikawa, Shunsuke Uemura, and Haruhiko Kojima : "The 
A~Tree:An Index. Structure for High-Dimensional Spaces UsingRelative Approximation" and In Proc. of the 
26th International. Conference on Very Large Data Bases(VLDB),pp.51 6-526, Cairo , September 2000. 
[B]The voice search phase P3 and search key voice time series feature amount vector extraction phase P4 
drawing 6 are the flow charts explaining operation of the voice search phase P3 and the search key voice 
time series feature amount vector extraction phase P4. 

[0027]First, the audio signal of the base period window length which becomes a search key is inputted using 
the search key sound signal input device 22 (Step S10). 

[0028]Next, in the search key voice time series feature quantity extracting part 18, the search key voice 
time series characteristic quantity of base period window length is extracted from the audio signal of the 
base period window length which inputted (Step S1 1). The characteristic quantity same as search key voice 
time series characteristic quantity as the voice time series characteristic quantity of the voice storage 
phase P1 mentioned above is used. Let this search key voice time series characteristic quantity be a search 
key voice time series feature amount vector (Step S12). 

[0029]In the characteristic quantity information comparing element 16, the similar distance of the obtained 
search key voice time series feature amount vector and the voice time series feature amount vector 
accumulated in the memory storage 15 in the voice storage part 14 is calculated (Step S13). It carries out 
to this distance calculation using the index accumulated in the memory storage 15 of the voice storage part 
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14 using static matching of the Euclidean distance between fixed-length vectors, a Manhattan distance, etc. 
Thereby, distance calculation more nearly high-speed than dynamic matching becomes possible. And at 
least order attaches the audio signal section of a voice time series feature amount vector to order with the 
short distance (Step S14). 

[0030]Finally, in the display style generation part 17, a voice time series feature amount vector is matched 
with the audio signal section of the extraction origin using the time window logging information accumulated 
in the memory storage 1 5 by the voice storage part 14, and it outputs to the display 21 (Step S15). The list 
information which shows the how many seconds it is from the head of a program, and the audio signal 
section in an order from the high result of ranking when displaying the search results of the portion which 
corresponds to a search key, for example out of the musical program of 1 hour in a display here, When the 
button for reproduction of the portion is displayed and a reproduction button is pushed, it performs 
outputting the sound of the portion. By this, a retrieving person can save now the time and effort which 
discovers the audio signal section which suits a retrieval object. When it has information, including a track 
name etc., in a database about a retrieval object, the track name of search results, etc. can be displayed 
collectively. 

[0031][Embodiment 2] Drawing 7 is a lineblock diagram for describing the embodiment of the invention 2. 
Embodiment 2 makes it possible to compress the characteristic quantity started using the time window, to 
make size small, and to search by building the smaller database of size. 

[0032]What carried out linearity compression of the partial tone voice time series characteristic quantity 
which carried out linearity elasticity, and which was arranged with the length of the base period window in 
Embodiment 2 on the time-axis is made into a voice time series feature amount vector. It uses and what 
carried out linearity compression of the search key voice time series characteristic quantity of the length of 
a base period window on the time-axis at the same length as a voice time series feature amount vector is 
used as a search key voice time series feature amount vector in connection with this. Therefore, compared 
with Embodiment 1, the voice time series feature amount vector extraction phase P2 called from the voice 
storage phase P1 and the search key voice time series feature amount vector extraction phase P4 called 
from the voice search phase P3 differ, and are as follows. 

[0033][A]The voice storage phase P1 and voice time series feature amount vector extraction phase P2 
drawing 8 are the flow charts explaining operation of the voice storage phase P1 and the voice time series 
feature amount vector extraction phase P2. 

[0034]An audio signal is inputted from the retrieval object sound signal input device 20 like Embodiment 1 
(Step S20). In the voice time series feature quantity extracting part 1 1 , voice time series characteristic 
quantity is extracted from the inputted audio signal (Step S21), and in the time window logging part 12, the 
partial tone voice time series characteristic quantity of base period window length is started, shifting a base 
period window little by little (Step S22). In [ start partial tone voice time series characteristic quantity also 
by the time window of two or more kinds of length (Step S23), and ] the partial tone voice time series 
feature fixed lengthHzed part 13, Linearity elasticity of the partial tone voice time series characteristic 
quantity of two or more kinds of started length is carried out on a time-axis, and the partial tone voice time 
series characteristic quantity arranged with base period window length is generated (Step S24). 
[0035]Next, in the voice time series characteristic quantity linearity compression zone 30, as shown in 
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drawing 9 , linearity compression of the partial tone voice time series characteristic quantity arranged with 
the length of the base period window is carried out in a time base direction, respectively, and a voice time 
series feature amount vector is extracted (Step S25). By this, the number of dimension of a voice time 
series feature amount vector can become small, the computational complexity of similarity calculation can 
be reduced, and capacity of the memory storage 15 to accumulate can also be made small. 
[0036]An index is built from the voice time series feature amount vector produced by making it be the same 
as that of Embodiment 1, and it accumulates in the memory storage 15 of the voice storage part 14 with a 
voice time series feature amount vector. 

[0037][B]The voice search phase P3 and search key voice time series feature amount vector extraction 
phase P4 drawing 1 0 are the flow charts explaining operation of the voice search phase P3 and the search 
key voice time series feature amount vector extraction phase P4. 

[0038]In [ input the audio signal of the base period window length which becomes a search key like 
Embodiment 1 using the search key sound signal input device 22 (Step S30), and ] the search key voice time 
series feature quantity extracting part 18, The search key voice time series characteristic quantity of base 
period window length is extracted from the audio signal of the base period window length which inputted 
(Step S31). 

[0039]And in the search key voice time series characteristic quantity linearity compression zone 31, as 
shown in drawing 9 , linearity compression of the voice time series characteristic quantity of base period 
window length is carried out in a time base direction, and a search key voice time series feature amount 
vector is extracted (Step S32). Let the length of a search key voice time series feature amount vector be 
the same length as the voice time series feature amount vector generated in said voice storage phase P1. 
[0040]The search key voice time series feature amount vector obtained in the characteristic quantity 
information comparing element 16 like Embodiment 1, Similar distance with the voice time series feature 
amount vector accumulated in the memory storage 15 in the voice storage part 14 is calculated (Step S33), 
and at least order attaches a voice time series feature amount vector to order with the short distance (Step 
S34). 

[0041]Finally, like Embodiment 1, in the display style generation part 17, a voice time series feature amount 
vector is matched with the audio signal section of the extraction origin using the time window logging 
information accumulated in the memory storage 15 of the voice storage part 14, and it outputs to the 
display 21 (Step S35). For example, the list information which shows the how many seconds it is from the 
head of a program, and the audio signal section in an order from the high result of ranking when displaying 
the search results of the portion which corresponds to a search key out of the musical program of 1 hour, 
The button for reproduction of the portion is displayed, and when a reproduction button is pushed, it 
displays by the display style which can output the sound of the portion. 

[0042][Embodiment 3] Drawing 1 1 and drawing 1 2 are the lineblock diagrams for describing the embodiment 
of the invention 3. Embodiment 3 does not input a search key from the outside, but when it searches once 
and search results are obtained, it newly respecifies a search key out of search results, and it enables it to 
search voice data similar to it. 

[0043]In order of similarity, two or more search results which use voice data as a search key are displayed 
on a screen, as shown in a table, and other similar voice data is searched by using as a new search key voice 
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data obtained as search results. 

[0044]For example, if the portion of the "track name" of a search-results display screen is touched in music 
search, the musical piece will be chosen as search results, and voice response will be played and carried out 
from the head of the portion corresponding to a search key, or a musical piece. If the portion of the 
"ranking" on a screen is touched, what performs a search of a further similar musical piece by using as a 
new search key the search results (for example, the voice data which is the original search key among 
musical pieces and a similar portion (for 3 to 4 seconds)) of the musical piece will be performed. 
[0045]As mentioned above, the result that it was the closest to the retrieval object is chosen as a search 
key from search results, and it refers to Embodiment 3 again. The voice storage phase P1 of the audio signal 
retrieval device 10 shown in drawing 1 1 is the same as the voice storage phase P1 of Embodiment 1 shown 
in drawing 1 , and the voice storage phase P1 of the audio signal retrieval device 10 shown in drawing 1 2 of it 
is the same as the voice storage phase P1 of Embodiment 2 shown in drawing 7 . In Embodiment 1 mentioned 
above and Embodiment 2, the voice search phase P3 differs and are as follows. 

[0046][A]Voice search phase P3 drawing 13 is a flow chart explaining operation of the voice search phase 
P3. 

[0047]As a preceding paragraph story, it searches like Embodiment 1 and Embodiment 2, at least order 
attaches the search results in order of similarity, and it displays on the display 21 (Step S40). Search will be 
ended if the displayed search results have fully agreed in the retrieval object with the directions from a user 
(Step S41). 

[0048]If it has not agreed in a retrieval object, in the search condition input section 23, the result nearest to 
the retrieval object of the search results to which at least the order already displayed was attached is made 
to specify as a search key, and it is re(Step S42) chosen as a search key. For this reason, when displaying 
the result to which at least order was attached in Step S40 with the display 21, two buttons for an input are 
displayed per result. One is a button pushed when specifying as a search key, and one is a button pushed 
when making the sound of the result utter. The user can specify a search key out of search results by 
pushing the former button on the display 21. 

[0049]If a search key is specified, in the characteristic quantity information comparing element 16, the 
accumulated high voice interval of a search key and similarity will be searched, and ranking will be carried 
out to the high order of similarity (Step S43). In the display style generation part 17, the display style which 
can choose a search key from search results is generated, and it displays on the display 21 (Step S40). 
[0050]If the displayed search results have fully agreed in the retrieval object, they will end search (Step 
S41). In fully not agreeing yet, a search key is rechosen further and it searches again, and it repeats until the 
result which fully agrees in a retrieval object is obtained (Steps S40-S43). 

[0051][Embodiment 4] In above Embodiments 1-3, extract the partial tone voice time series characteristic 
quantity from which length differs using the time window of two or more kinds of length, respectively about 
the audio signal of a retrieval object, and linearity elasticity of those partial tone voice time series 
characteristic quantity is carried out. What was arranged with the length of the base period window was 
accumulated by the voice storage part 14 as a voice time series feature amount vector of a retrieval object. 
[0052]According to Embodiment 4, linearity elasticity of the partial tone voice time series characteristic 
quantity started by the time window is not performed about the thing of a retrieval object, but is performed 
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about what was inputted as a search key. That is, in Embodiment 4, partial tone voice time series 
characteristic quantity is not extracted from the audio signal of a retrieval object using the time window of 
two or more kinds of length, but partial tone voice time series characteristic quantity is extracted only using 
a base period window, and it is accumulated in the voice storage part 14 as a voice time series feature 
amount vector of a retrieval object. On the other hand, about the audio signal inputted as a search key, the 
partial tone voice time series characteristic quantity from which length differs using the time window of two 
or more kinds of length, respectively is extracted, and linearity elasticity of the partial tone voice time series 
characteristic quantity from which such length differs is carried out so that it may become the length of a 
base period window. 

[0053]Even if it follows that into which linearity elasticity was inputted as a search key, the same search 
results as Embodiments 1-3 are obtained. After keeping step with the length of a base period window, 
reduction of the data volume compared with fixed length by carrying out linearity compression like 
Embodiment 2 at the time of search can also be aimed at if needed. 

[0054][Experimental result] In order to check the validity of this invention, two kinds of experiments for the 
singing data created to the experiment about Embodiment 2 were conducted. 

[0055]First, in the 1st experiment, compared with matching which does not correspond to linearity elasticity, 
in search of singing voice, matching corresponding to the linearity elasticity by this invention checks an 
effective thing, and then in the 2nd experiment. Even if compared with the conventional system which 
mounts nonlinear elastic matching, it checked that this invention was sufficiently effective in search of 
singing voice. 

[0056][A]The test subject of 15 experimental conditions common to both experiments is divided into a 
woman, a male, and three mixed groups, and a 62 song name list is passed to each group. In a song name list, 
I get those who know the song to sing a part of phrase (about 1 0 seconds), and 62x3 singing voice (a total of 
about 30 minutes) is stored in a database. 

[0057]About (base-period window length: 150-frame about 4 seconds) one phrase is arbitrarily taken out 
from 12 music selection and there out of one test subject group's singing voice, and it is considered as a 
search key. And the same phrase portion of the singing voice other two groups' test subject is searched as 
conformity results. 

[0058]Suppose that singing voice is searched from the voice data of 44100 Hz of sampling frequencies, the 
quantifying bit number of 16 bits, and the wave file format of one channel in this experiment. As a voice 
feature amount, the 5-dimensional low following paragraph of the Meru frequency cepstrum coefficient is 
used, and time series characteristic quantity is extracted from what arranged this in on a time-axis using a 
time window. 

[0059]An average of average search length is used as a valuation basis of search results. This is a valuation 
basis showing the time and effort which discovers the result which suits a retrieval object out of search 
results. It should search, even if we decided to judge the conformity up to the 20th place among the search 
results by which ranking was carried out, ranking was carried out to to the 20th place and less than it had 
conformity results. 

[0060][B]explanation of an average of average search length — here, an average of the average search 
length which used as a valuation basis of search results is explained. Average search length is stated to the 
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following reference 5. 

[Reference 5] Tokunaga "language, calculation 5 information retrieval, and language processing", the 

University of Tokyo publication, 1999. 

Average search length is a measure which evaluates the set to which at least order was attached as search 
results. When the result to which at least order was attached as search results has returned, the retrieving 
person has to judge the conformity of search results in detail from the result of a higher rank actually. 
Average search length is a measure which measures the time and effort of the user which must judge the 
conformity of a result, in order for a retrieving person to obtain a required number of conformity results in 
consideration of the process of such a retrieving person's conformity judgment. 

[0061]For example, suppose that search results were able to divide into set St with an order, S 2 , and S 3 as 
follows. However, an order during a set is the order of S t , S 2 , and S 3 , and O and x express conformity results 
and an incongruent result, respectively. 
[0062]S 1: {O f x, x, x} 
S 2 : {O, O, O, O, x, x} 
S 3 : {O, O, x, x} 

Now, suppose that a retrieving person wants to obtain one conformity results. First, set S! will be inspected. 
Since an order is not attached in this set, the expected value of the number of the result which must be 
inspected by the time it finds conformity results is set to 4+2x1 [ 1x1/1/4+3x1/4+4x1/4=2.5. 
[0063]In order for this to find one conformity results from search results, the retrieving person shows that 
the conformity of 2.5 search results must be judged on the average. That is, the number of average search 
length required to find out one conformity results from these search results is 2.5. 

[0064]Since what is necessary is just to find one from set S 2 after all inspecting set S 1 in order to find two 
conformity results, the expected value of the number of the result which should be inspected is set to 
x(4+1)4/6+(4+2) x4/15+(4+3) x1/1 5=5.4. That is, the number of average search length required to find out 
two conformity results is 5.4. 

[0065]In the above-mentioned example, although it was a case where search results were given by the set 
to which at least order was attached, even when the total order is attached to each of search results, if the 
element of each set is considered to be one, average search length can be calculated. 

[0066]Average search length does not become one measure, but becomes a value depending on the number 

of required conformity results so that the above thing may also show. Then, the value of the average search 

length per required conformity results is calculated as an average of average search length. 

[0067]When average search length required to find [ the required number of conformity results ] M and i 

required conformity results for i and the total number of conformity results is made into x(i), an average of x 

av of average search length is x av =(1/M) sigma i=1 M {x(i) /i}. 

It is come out and expressed. 

[0068]For example, the case where the result of having suited is searched by the 2nd place and the 6th 
place is considered as a result of search. When the number of required conformity results is set to 1, 
average search length is set to 2, and when the number of required search results is set to 2, average 
search length is set to 6. An average of these average search length is set to (2/1+6/2) / 2=4. 
[0069][C]It experiments by the method of extracting time series characteristic quantity as a method which 
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does not correspond to the 1st experiment linearity elasticity only using the time window of one kind of 
length. On the other hand, it experiments as a method corresponding to linearity elasticity by the method by 
this invention using nine kinds of time windows. The value of an average of this average search length that is 
both a method is compared. 

[0070][D]The result of having compared the average of the average search length by the 1st experiment as 
a result of the 1st experiment is shown in drawing 14 . x in a figure means that conformity results were not 
able to be searched. Although an average of average search length becomes the same value in the song B, 
E, and H, in the other song, the method by this invention has exceeded it altogether. That is, when the case 
where the linearity elasticity by the conventional method was not used for the voice data of a retrieval 
object by a fixed-length time window was compared with the case where the linearity elasticity by this 
invention is used, it turned out that average search length is short, and it ends, and is effective in retrieval 
precision improving by about 2 times from 25%. Therefore, the method corresponding to the linearity 
elasticity by this invention can be said to be effective in search of singing voice. 

[0071][E]An average of the average search length of the conventional method and the method by this 
invention is compared using the voice search service of "CrossMediator for VideoV2.0 (R1)" of media drive 
incorporated company as the conventional method which can respond to the 2nd experiment nonlinear 
elasticity. It is checked by simple matching whether search time is reducible. 

[0072]Search time measures time until search results are displayed after pushing the button for starting the 
search on an indicator 10 times manually, and makes the average value search time. CPU of having used it 
for the experiment this time is a personal computer Pentium4 (1.7 GHz) of U.S. Intel and whose main 
storage capacity are 654,812 KB. 

[0073][F]The result of having compared the average of the average search length by the 2nd experiment as 
a result of the 2nd experiment is shown in drawing 15 . x in a figure means that conformity results were not 
able to be searched. Although the result of the method by this invention is a little less in the songs C and G, 
by all the results, the result more than equivalent is obtained from drawing 15 . In the method by this 
invention not using nonlinear matching, it turns out that it can be searching the conventional method and 
more than equivalent. 

[0074]Next, using the song B in drawing 1 5 as a search key, search was repeated once and it compared by 
making the average into search time. In the conventional method, the place which had taken 4.29 seconds 
became about 2.42 seconds quick by the method by this invention, and shortening of the search time by 
simple matching was checked. That is, when how to use the conventional nonlinear elasticity was compared 
with the method of using linearity elasticity of this invention, while retrieval precision was equal in the case 
of the method by this invention, it turned out that processing speed (CPU burden) improves about 56%. 
[0075]About the validity of the above this invention, it is clear that the same may be fundamentally said of 
Embodiments 1 and 4. 

[0076]By the computer and a software program, can realize processing of each embodiment described 
above and the program, It can store in suitable recording media, such as a portable medium memory which a 
computer can read, semiconductor memory, and a hard disk, and a computer can be performed by reading 
from there. The program can be downloaded from other computers via a communication line, and can also 
install and perform it. 

14 



JP 2003-248494 

[0077] 

[Effect of the Invention]Although dynamic matching which can respond to nonlinear elasticity of an audio 
signal was used for the distance calculation which expresses similarity with the method of searching the 
high audio signal section of similarity out of the audio signal of a retrieval object using the audio signal of the 
conventional search key, This invention reduces computational complexity compared with the case where 
an audio signal mainly uses dynamic matching for distance calculation by using static matching only 
corresponding to linearity elasticity for the distance calculation showing similarity linearity elasticity, 
however when there is nothing, and has the effect of lessening search time. 

[0078]Even if all the audio signal sections that can use for a search key the audio signal section obtained as 
search results in this invention, and suit the retrieval object in the audio signal of a retrieval object by 
search of eyes once are not obtained, Search narrowed down by rechoosing the audio signal section which 
suits a retrieval object as a search key from the audio signal sections of search results can be performed, 
and it has the effect that a possibility that the audio signal which suits a retrieval object can be acquired 
becomes high. 
* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

LThis document has been translated by computer. So the translation may not reflect the original precisely. 
2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[Drawing 1] It is a lineblock diagram for describing the embodiment of the invention 1. 

[Drawing 2] It is a flow chart explaining operation of a voice storage phase and a voice time series feature 
amount vector extraction phase. 

[Drawing 3] It is a figure explaining processing of a time window logging part. 

[Drawing 4] It is a figure explaining the example of how to start a time window. 

[Drawing 5] It is a figure explaining the linearity elastic method of partial tone voice time series 

characteristic quantity. 

[Drawing 6] It is a flow chart explaining operation of a voice search phase and a search key voice time series 
feature amount vector extraction phase. 

[Drawing 7] It is a lineblock diagram for describing the embodiment of the invention 2. 

[Drawing 8] It is a flow chart explaining operation of a voice storage phase and a voice time series feature 
amount vector extraction phase. 

[Drawing 9] It is a figure explaining how to carry out linearity compression of the partial tone voice time 
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series characteristic quantity of base period window length in a time base direction. 

[Drawing 10] It is a flow chart explaining operation of a voice search phase and a search key voice time 

series feature amount vector extraction phase. 

[Drawing 1 1] It is a lineblock diagram for describing the embodiment of the invention 3. 
[Drawing 12] It is a lineblock diagram for describing the embodiment of the invention 3. 
[Drawing 13] It is a flow chart explaining operation of a voice search phase. 
[Drawing 14] It is a figure showing the result of the 1st experiment. 
[Drawing 1 5] lt is a figure showing the result of the 2nd experiment. 
[Description of Notations] 
P1 Voice storage phase 

P2 Voice time series feature amount vector extraction phase 
P3 Voice search phase 

P4 search-key voice time series feature amount vector extraction phase 

10 Audio signal retrieval device 

1 1 Voice time series feature quantity extracting part 

12 Time window logging part 

1 3 Partial tone voice time series characteristic quantity fixed lengtbHzed part 

14 Voice storage part 

15 Memory storage 

16 Characteristic quantity information comparing element 

17 Display style generation part 

18 Search key voice time series feature quantity extracting part 

20 Retrieval object sound signal input device 

21 Display 

22 Search key sound signal input device 

23 Search condition input device 

30 Voice time series characteristic quantity linearity compression zone 

31 Search key voice time series characteristic quantity linearity compression zone 
40 Search condition input section 



[Translation done.] 
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[0054] c^mmx] *«9§flD*att*»ffi-r*fc 

[0 0 5 5] ffll^tiT, 

[0 0 5 6] [A] MHKtcJtii>rs^»*ft 



1 5AOi»«*ftfe yjtt, S^O3O<0dT;l/-^fc: 
^nfti^^l/- 6 2 ffi<£lK£ 'JXh £;}g 

XO- t'fP C*?J 1 0#ftU£) «oTt6l\ 6 2 X3ffl 

c^tf-^3 o») *T-$^-ximm-?z>o 

[0 0 5 7] 1 OOSKfif;l/- ^OSR^O^SffiK 

&: 1 5 071/-^, ft 4 80 SffiOfflLtfti^ 
<h-T£o fit, {to2o^^I/-^»«-(?)»^ 
10 |W|— y XSP»*ai^ie*4: LtttiStSo 

[0 0 5 8] ^>7°U >>?mW.WL4 4 1 0 

Oil z, S?{blf^ h»l 6 b i t, l w a 

v e 7 7^«S^)f pf-^^e^^t §l ^ 

[0 0 5 9] tt^'£OflMifliS¥£: LTfct. WDSigg 

20 43, )ffiffiWt^nfc«a5S5*0'5-& > 2 OfiST?^ 

[0060] c b ] ^mmmwvmmm 

1MB»ffl" , jfiMA¥*/K, 1999. 

^lottfdK, iastc:tt«t3)i#« f tt*e*<oa-& 

[oo6i] «ix.tf. mmm^ti^cokoicmm-ttM 
^si , s 2 , s 3 tc^j-sch^T^/c^-r^o run 

[0 0 6 2] Si : {O, X, X, X} 

5 2 : {O, O, O, O, x, x} 

53 : {O, O, x, x} 

1 X 1/4+2X 1/4+3X 1/4+4X 1/4= 
2. 5 
so ^^So 
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[0063] cn&mRtsi&fr i om-ott 

(4+1) X4/6 + (4 + 2) X4/15+ (4 + 
3) Xl/1 5 = 5. 4 10 

^TOSfSfSte, 5. 4fflTfeS 0 
[0 0 6 5] ±IEOt»Tfct. ttfR»*^IiKfij[#tJ6nfc 
6 tlTt^ 5 J&&T o fc , ttfRifiSftora <f 

[0 0 6 6] tLhoc ^^6 6*3*^^*5 *Fi3S* 

[0 0 6 7] ^Sftl^SS* i , 8Sft£*S*ft« 
^S^rx (i) tnt, ¥^K^fi^^xav^, 

Xnv= (1/M) Zi.l « {X (i) /n 

[0 0 6 8] 0|*fcf. &ig<DJSJli, iS^L^Jgtf 2ffi 

nso^JKiRsovtaa. (2/1 + 6/23/2= 30 
4 ^%s D 

[0069] [c] mi(omm 

So -75", «JK#«fc»jS-rs>3jSi:LT 1 9ii« 

[0 0 7 0] [D] % 1 OUKOJSi 
Ch^SLTt^o ¥^S^fiO¥£jte, RB. E, H 

ted; ^i^)^f-^^M^ Lt, fifc&^ffifc: 

*ffl^fc«*fc*lfciR-ra fc. ¥^fS?Stf$fi< Tr3 
tt*»*^2 5%A^6 2fiaftlRl±'rsi:^-3»* 



[0 0 7 1] [E] 582<03a& 
-f:/$fc^zs?±<9 TCrossMediator for Video V2.0(R1)J 

[0072] ttX^Fiuia, a^*±ott**BBift-raft 
H$im*#i6T 1 0 mmm l, ^-o^fisttsiBSBB t ~r 

So ft*?. ^E^KteteJALfcOte, CPUtf*IIn 
t e 1 $±CD Pentium4 ( 1 . 7 G H z ) , ±Elt 
§1^6 5 4, 8 1 2 K BCO^— y+;brt>tfo. — ^T 
35 So 

[0 0 7 3] [F] S2©H»Oie« 
£ ^SLTl^So H15*>6, »C, *%0JJ 

[0 0 7 4]*:, »**-fcLT8l5*(DBB*ffi 

(SSSofco fie*^TSTtt, 4. 2 9#^froT^fc£ 
£Stf t *f?0J3tcJ:S^J£T^2. 4 2»lS2ia<& 

[0 0 7 5] W±tO*«W^^tttCOV>Tt± > ^!fiS£) 
TfeS„ 

[0076] &±mm^rc&mM(DBm<Dmmi$, ^> 

[0 0 7 7] 
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[0 0 7 8] Sfc, *fSB^T{±, fc&igai£«£ 

[0S<Dfg¥&taBfl] 
fSffi 7 x - XOWlftZmW-? & 7 a — ^ -V - h T <S 5 0 

[0 3 ] B#re^w k> m Lsqojaa*sewrs0T?ss„ 

So 

[S7] *mn<Dmm<D&m2*:mffl?zKib<Dmis.m 

[08] «^»«7x— Xk*?«^lR5iJ1W»^^h;U 
[0 9 ] SV«nS!fi«)ffi^*^l9S9iJKFflU«^|III«i 

[010] gjz&my^-xt&m^-ttpmz&m® so 
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[011] *&&i<D9m(o&te3*iiffl-t*itibv>ma, 

0T'feS o 

[012] *58WJ©*jtlOJB^3*rtJiBJ3-rSfe«>Ofi?rit 
0T'$.5c 

[01 3] g^7i- xoymftizmm-?z>7vi~ =f- 

[01 4] ^l©*»©^S^tg|T6S ; 
[0 15] sB2OH»O«S*«^-r0T?feS o 

[fS-^<DlfcBi3] 

P 1 ^I«7x-X 
P3 Sp^f7x-X 

P4 tem^-^mikvmwim^thfrmmyx.-x 

1 0 ^{i^*£lS?SK 

i i &pu$mmmmmm® 

1 2 BfUg^OffiLiJP 

i 3 ztfi^mmvmmmmm&it^ 
1 4 ^pwmst 

1 5 

i e wmmaxea 

1 7 a^se^jsas 

1 8 ^m^-^m^mtsmmta^ 

2 o mmtim&jzm^xti&m 

2 1 S^itt 

2 2 fc£*^-«;S{l ^A^SE 

2 3 tkmtkft xtsmm 

3 o ft^B$^^j#aas^jiSft'r{p 

3 1 tt^mvmttmmmimmm 

4 o mtfrnxjifo 
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